The 10 longest words with frequency>1, ordered by length
Length | Frequency | Word |
---|---|---|
98 | 98 | දේවාක්ෂර |
90 | 90 | ක්ෂීර |
73 | 73 | රටාකෙරෙහිමබලපෑම්ඇතිකරනබැවිනික්ලමථපාලනඋපක්රමභාවිතකිරීමක්ලමථයගොදුරුවනතුරුම |
72 | 72 | 東北電力によると11日午後5時時点で青森県・秋田県・岩手県全域、および山形県・宮城県のほぼ全域が停電し、東北地方だけで約440万世帯が停電した。 |
64 | 64 | 宮城県警によると仙台市若林区荒浜の沿岸地区で津波に巻き込まれ溺死したとみられる200~300人の遺体が発見されたと報じられている |
53 | 53 | ගම්භාර,දයියණ්ඩ,වීරමුණ්ඩ,දැඩිමුණ්ඩ,අයියනායක,කතරගම,සුමන |
49 | 49 | භේරුණ්ඩ-මකර-ගජසිංහ-ඇත්කඳළිහිණි-කිඳුරු-සරපෙන්දිරූප |
43 | 43 | නිර්මාණයකරනලදබෞද්ධචිත්රබෞද්ධයෝමහත්සේඅගයකලහ |
41 | 41 | ජලය,කාබන්ඩයෝක්සයිඩ්,හයිට්රජන්,නයිට්රජන් |
41 | 41 | සීගිරිය,හිඳගල,දිඹුලාගල,මිහින්තලය,මහියංගනය |
The longest words of the corpus with minimum frequency 2 are shown. The words are seen at least twice, hence, there is some chance for not seeing misprinted words.
Surprisingly, there is no longest word which is much longer than the second one. This, again, argues for correct preprocessing.
In the case of correct preprocessing, the longest words are true words. In many cases, they belong to some topics which can generate these long words.
In the case of poor preprocessing, some non-word strings will appear.
The length of the longest words clearly depends on language and corpus size.
select char_length(word) as le, freq, word from words where freq>1 order by le desc limit 10;
How does the length of the longest words increase with corpus size?
3.2.3.1 Longest Words in top-1000 by length